The Habits of Goodreads Users

An Exploration of Ratings, Young Adult Novels, Tags, and More
Data Science 1 with R (STAT 301-1)

Author

Valerie Chu

Published

December 6, 2023

Introduction

What is Goodreads?

This is how Goodreads describes itself: “Goodreads is the world’s largest site for readers and book recommendations. Our mission is to help readers discover books they love and get more out of reading. Goodreads launched in January 2007.”

Among other unique features, Goodreads allows users to:

  • Rate books

  • Write book reviews

  • Track and tag the books they’re reading, have read, and want to read

Data Overview and Quality

What is my data about?

This dataset contains six million ratings for the 10,000 most rated books on Goodreads.

It was last updated on Sept. 19, 2017, so books published after that date won’t appear in this dataset.

It comes in five separate csv files: “books”, “books_tags”, “ratings”, “tags”, and “tbr”

(For context, users on Goodreads can tag books and add them to their shelves. And “tbr” stands for “to be read”.)

In this document, when I say “Goodreads data” or “Goodreads dataset”, I am referring to the five datasets generally. For individual datasets, I will use their specific names.

(All data is in the “data” folder, and I didn’t have to clean up any of the five original datasets, so they’re not in there.)

Why am I interested in this dataset?

The reason why I’m interested in this dataset is extremely simple: I love reading. Analyzing Goodreads data / Goodreads user data seems super fun.

More specifically, I’m interested in analyzing user habits on Goodreads. This is why I chose this dataset. I’ve spent several hours combing through Kaggle, Google, Reddit, and other places on the internet. As far as I can tell, this is the most comprehensive dataset about books and user readings habits I can find, freely available from the internet. It was scraped from Goodread’s and Goodread users’s publicly available data.

More about the data

The Goodreads dataset is of high quality.

There are no missingness values in any of the five dataset included in the Goodreads dataset except for “books”, which is missing some values in: language_code, isbn, isbn13, original_title, original_publication_year. This will not effect my exploratory data analysis, although I will keep the missingness in mind. (See “01_Exploring_Missingness.R” for more.)

The “books” dataset:

  • There are 10,000 observations.

  • There are 23 variables.

  • There are 17 numerical variables.

  • There are 6 categorical variables.

The other four datasets can be combined with each other and with the “books” dataset in ways that can enhance my exploration of Goodreads data and user habits.

My Objectives

There are several questions I’m interested in examining in Goodreads data.

Some of these questions include:

  • What are the most highly rated books?

  • Do people hate-rate or love-rate books?

  • What’s the distribution of average ratings?

  • What are the most common tags?

  • What years were young adult books published in?

I divided my exploration of the Goodreads dataset into two parts. The first part looks at the rating habits of Goodreads users. The second part focuses on how Goodreads users interact with young adult (YA) novels on Goodreads.

Joining Data

Note

The joining data step took a lot of exploration due to some unclearness in the README I got this dataset from. I made a qmd in this project called “00_Joining_Datasets” that walks through my process of joining these two datasets in more depth.

Step 1: Joining tags and books_tags

Why am I joining tags and books_tags?

When I join the “books_tags” dataset together with the “tags” dataset, I can figure out what each book was tagged with.

Step 2: Joining books and ratings

Why am I joining books and ratings?

When I join the “books” dataset with the “ratings” dataset, I can see the rating each user gave each book.

Now, for the fun part. An EDA with our two new datasets, “books_and_tags” and “books_and_ratings”. And some of the original datasets too.

Exploration 1: The Rating Habits of Goodreads Users

Part 1: Which 10 books have the highest average Goodreads rating?

In an exploration of the rating habits of Goodreads users, there seems to be an obvious place to start: What are the most highly-rated books on Goodreads?

10 Books with the Highest Average Goodreads Rating

Table 1: 10 Books with the Highest Average Goodreads Rating

I’m surprised at some of the ratings of these books, yet very much not surprised at others. These are some things we can see from Table 1:

  • “The Complete Calvin and Hobbes” is more geared toward middle-grade readers, so it’s odd that it has the highest rating.

  • But I’m not surprised that “Words of Radiance” had the second highest average Goodreads rating. I know that the Goodreads demographic tends to be young adults and book bloggers, most of who all enjoy reading and rating young adult fiction on Goodreads. That’s a category this book falls within.

  • I definitely expected a Harry Potter book to be one of Goodreads’s highest average books, but not necessarily a boxed set. (Although I should note, the problem with Goodreads is that it considers boxed sets individual books, and there are neither string functions that I can use to filter them from individual books nor tags I can use to filter them, since the naming conventions vary from boxed set to boxed set and closely resemble how individual book titles appear.) But just from being a casual Goodreads user, I’ve observed that boxed sets tend to get higher ratings than individual books of the series. So in a way, it is unsurprising that the trend holds true here.

  • The “ESV Study Bible” is also something I expected to see on this list, although I expected it to be ranked higher. The Bible is an important book for many people.

  • I am surprised that various spinoffs of Calvin and Hobbes and Harry Potter dominate 7 of 10 rankings in the top 10 most highly rated books. I knew they were popular, but not that popular. But I guess Goodreads is also an American company, and those are the types of books Americans read and love.

Part 2: Do people hate-rate or love-rate books?

After an exploration of the most highly-rated books on Goodreads, this question came to my mind.

As someone who tends to rate books that I love more than books that I’m neutral about or dislike, I’m curious about whether other people also have the same tendencies as me, and whether this shows in the 10,000 most rated books on Goodreads as of 2017.

If there actually is a love-rate or hate-rate trend, this means that there would be either a positive or negative correlation between book popularity and average rating. After all, the most promoted/most popular books tend are likely to make their way onto this dataset of most rated books. So are those ratings good or bad?

In other words, is there a relationship between a book’s number of ratings (ie. popularity) and its average rating?

Figure 1: Do people tend to rate popular books more highly?

Figure 1 shows that There appears to be no relationship between a book’s average rating and its rank based on the number of ratings it received.

Looking at the correlation Table 2 confirms this: The correlation between rank_of_number_ratings andaverage_rating is almost 0.

Table 2: The correlation between a book’s rank and its average rating
average_rating rank_of_number_ratings
average_rating 1.0000000 -0.0863433
rank_of_number_ratings -0.0863433 1.0000000

I’m very surprised at this finding. I had expected the rank of the number of ratings a book received to have at least some correlation with the average rating.

So, it seems people are neither more nor less inclined to rate a book based on whether they hated or loved that book.

Part 3: How many books and what percent of books on Goodreads are written in English?

Figure 1 was so green because most of these books are written in some variety of English. Unsurprisingly, the people who use Goodreads, an American-based company, mostly rate books written in English and other European languages.

So, how many books and what percent of books on Goodreads are written in English?

Table 3: How many books and what percent of books on Goodreads are written in English?

Table 3 reveals both some surprising and some unsurprising data:

  • The green-ness of Figure 1 already told us to expect that most books on Goodreads would be written in some variety of English. However, I was surprised at the sheer extent of books written in English: 63.41% of books have the language code “eng”, while 20.7% of books were written in American English and 2.57% were written in British English.

  • I did not expect that “Other” would make up 10.84% of the 10,000 most rated books on Goodreads. That’s a lot of other languages.

  • I was also surprised that Arabic was the 5th most popular language on Goodreads, after several varieties of English and “Other”. I instead expected European languages to dominate the Top 10 most popular languages on here since Goodreads is an American company, but Arabic made the list.

Part 4: What’s the distribution of book ratings?

While working on Parts 1-3, this question naturally came up. We already know that most books on Goodreads are published in English, and that there tends to be no correlation between a book’s number of ratings and its average rating. But let’s take a step back and dive deeper into what book ratings on Goodreads tend to look like.

When I look at a Goodreads rating, I don’t think of the rating scale as continuous. I think about them in bins of 0.25.

For example, a book with an average rating greater than 4.5 stars is excellent. A book with ≥ 4.75 is practically unheard of. And a book with an average rating between 4.0 and 4.25 is great.

So that’s why instead of using a histogram, I’m going to put average ratings into bins of 0.25 and create a bar plot that will display the distribution of book ratings on Goodreads in a way that’s intuitive to think about.

Distribution of Book Ratings

Table 4: Distribution of Book Ratings
Average Rating Number of Books
(2.25,2.5] 1
(2.5,2.75] 1
(2.75,3] 12
(3,3.25] 66
(3.25,3.5] 275
(3.5,3.75] 1189
(3.75,4] 3269
(4,4.25] 3695
(4.25,4.5] 1363
(4.5,4.75] 124
(4.75,5] 5

Table 4 shows the distribution of book ratings in numbers. Let’s visualize it.

Figure 2: The Distribution of the Average Rating of Books on Goodreads

These are a few things I found interesting about the distribution of the average rating of books on Goodreads, as seen in Figure 2:

  • It’s left skewed and unimodal.

  • Most books (3695 of them, as seen in Table 4) have a rating between 4 and 4.25.

  • Another 3269 books (see Table 4) have a rating between 3.75 and 4.

  • 1 book has an average rating between 2.25 and 2.5. 1 book has a rating between 2.5 and 2.75.

  • 5 books have an average rating between 4.75 and 5.0.

So clearly, there are very very few books in the 10,000 most rated books on Goodreads with an average rating of less than 3. And there are very very few books with an average rating of more than 4.75.

That means that people who rate the popular books on Goodreads usually either:

    1. Like the book enough that they went to Goodreads and rated it decently highly (a 3, 4, or 5).
    1. Don’t like to rate extremely high or low.
    1. Or the number of people who gave books a decently high rating tend to pull up the average ratings of the people who rate books lowly.

Without more data, it’s hard to tell whether we can explain away the cluster around book ratings between 3.75 and 4.25 as one of these suggestions, a combination of these suggestions, or none of these suggestions. It’s still fun to think about though.

Also, I should note again that in the context of this dataset containing only the most rated (ie. popular) books on Goodreads, it does make sense that books with lower ratings likely aren’t promoted enough — and therefore likely aren’t rated enough — to appear on this dataset.

Part 5: Do readers who leave the most ratings leave the higher ratings on average or lower ratings on average?

Table 4 and Figure 2 showed us the distribution of book ratings, while Figure 1 explored whether people love-rate or hate-rate books. But is there a trend in the number of books people have read and how they rate books? Does reading more books make people set their expectations higher and thus rate books lower on average, or is it maybe the other way around?

Part 5, Section 1: Who are the readers who leave the most ratings on Goodreads?

First, let’s start by seeing which readers leave the most ratings on Goodreads.

Each user (user_id) can give one rating to one book. There are 53,424 users who rated the 10,000 most rated books on Goodreads. These are the top 10 raters.

Top 10 Goodreads Users Who Rated the Most Books in this Dataset

Table 5: The 10 Goodreads Users Who Rated the Most Books in this Dataset
user_id count
12874 200
30944 200
12381 199
28158 199
52036 199
6630 197
45554 197
7563 196
9668 196
9806 196

Table 5 shows that the users who rated the most books in this dataset all rated around 200 books. If we also keep in mind that this dataset only has the data of the 10,000 most rated books on Goodreads, having users who rated 200 books of these books is quite impressive.

However, I did expect this number to be higher, especially since I know the people who tend to use Goodreads are bookworms who can read dozens of books per year. But since Goodreads was launched in January 2007 and might not have gained popularity until more people had access to technology, this could make sense.

Part 5, Section 2: How do top raters tend to rate books?

Now that we know which Goodreads users rated the most books, it logically follows for us to examine whether there’s a trend in how top raters tend to rate books.

This graph shows the relationship between a user’s rank (based on the number of books they rated) and the average rating they gave. The colors are pretty evenly spread across the graph, and it’s there seems to be no trend in how top raters rate books.

Figure 3: Do Top Raters Give Higher Or Lower Average Ratings?

Let’s confirm this with a correlation table.

Table 6: The correlation between a user’s rank and their average rating
count id average_rating
count 1.0000000 -0.0634094 -0.0825340
id -0.0634094 1.0000000 -0.0150214
average_rating -0.0825340 -0.0150214 1.0000000

In retrospect, I’m not surprised Figure 3 and Table 6 show that there’s no correlation between how many books someone rated and whether they had a higher or lower average rating.

Table 2 already found that there was no correlation between a book’s average rating and its rank it based on the number of ratings it received. It makes sense that if there’s no rating between number of ratings and average rating, top Goodreads users wouldn’t differ from this average trend either.

Exploration 2: Young Adult Novels on Goodreads

The young adult (YA) genre is one I’ve always enjoyed reading. Many people who use Goodreads, such as book bloggers and book YouTubers, also tend to prefer the YA genre. The YA genre is quite popular on Goodreads.

This is why YA novels are my second area of exploration for this Goodreads dataset.

Part 1: What are the most common tags?

I know that the YA genre is quite popular on Goodreads, but I was also curious what the most popular tags were.

For context, users on Goodreads can tag books and add them to their bookshelves. It’s basically a way to personalize the labeling of books.

The most common Goodreads tags

Table 7: What are the most common tags?

Table 7 reveals some interesting trends in Goodreads data:

  • “to-read” is, of course, the most popular tag. Goodreads basically lets users use it as a default tag. Same for “favorites”.

  • The first dozen or so tags are all plain, simple tags that resembled labels. They were to be expected. However, I was surprised that “read-in-2015”, “read-in-2016”, and “read-in-2014” made the list of the top 30 most popular tags. People must love categorizing their bookshelves by date, which I did not expect.

  • “fiction” and adult” were some of the first few genre tags to appear. And of course, “fantasy” and romance” made an appearance at ranks 37 and 38 respectively.

  • And “young-adult” debuted at rank 49, just before “english”.

Part 2: Which are the highest rated young adult (YA) books?

Now that I knew which were the most popular tags, I was curious which were the most higly rated YA books. Let’s explore that.

100 Most Highly Rated YA Books

Table 8: 100 Most Highly Rated YA Books

Much like for Table 1, I was unsurprised by the sheer amount of Harry Potter and Calvin and Hobbes books that made the list of the 100 most highly rated YA books.

Goodreads is, after all, based in America.

But both lists were remarkably similar in their top picks. So perhaps YA books tend to get higher average ratings than most other genres.

Part 3: What years were young adult (YA) books published in?

This is something I’ve always been curious about. YA novels are, by name, young adult novels. They were written for young adults. And that’s a trend that has been going on for a long time.

Naturally, I was curious about how many books (and how ancient these books) have been tagged by Goodreads users as YA books.

Figure 4: What years were YA books published in?

It’s quite funny to me that books published as far back as -1750 have been categorized by Goodreads users as YA books. But that’s the beauty of the genre. It continues to evolve.

Part 4: Which YA books have more ratings?

There is, of course, one final question I need to know the answer to — the question that started this Goodreads data exploration when I searched around for the next unique YA book I should check out and realized there was a more efficient way to go about this than asking everyone I knew.

Perhaps I should have simply created a scatterplot to see what years I was missing from my YA repertoire.

Figure 5: Which YA books have more ratings?

Which YA books have the most ratings? It would appear that the years after 1750 won. But first, maybe I should start from YA novels published in the year -1750 and work my way forward in time.

Conclusion

For anyone who loves books, this Goodreads dataset holds a treasure trove of information.

Through these data explorations, I answered everything I’ve ever wondered about Goodreads user habits, user tag trends, YA book trends, and more.

Among other things, I learned that:

  • Goodreads users don’t love-rate or hate-rate books.

  • Most books on Goodreads are written in English, but Arabic also appeared as a top 5 language.

  • “young-adult” is the 49th most popular tag.

  • YA books have been published as far back as -1750.

Working with this dataset has taught me so much about Goodreads, book trends, and data exploration. I’ve learned so many interesting things from this dataset. And best of all, I know which books to read next.

References

Zając, Zygmunt (2017, Sept. 19). Github. https://github.com/zygmuntz/goodbooks-10k